Scalable telomere-to-telomere assembly for diploid and polyploid genomes with double graph

Despite recent advances in the length and the accuracy of long-read data, building haplotype-resolved genome assemblies from telomere to telomere still requires considerable computational resources. In this study, we present an efficient de novo assembly algorithm that combines multiple sequencing technologies to scale up population-wide telomere-to-telomere assemblies. By utilizing twenty-two human and two plant genomes, we demonstrate that our algorithm is around an order of magnitude cheaper than existing methods, while producing better diploid and haploid assemblies. Notably, our algorithm is the only feasible solution to the haplotype-resolved assembly of polyploid genomes.

1 Software commands 1.1 Filtering ultra-long reads We discarded short ultra-long reads to avoid assembly errors and reduce the running time. For HPRC Year 2 samples, ultra-long reads with a length less than 100kb were filtered out using seqkit (version 2.3.0): seqkit seq -m1000000 <ultra-long-reads.fasta> In the case of HPRC Year 1 and plant samples, ultra-long reads shorter than 50kb were removed. seqkit seq -m500000 <ultra-long-reads.fasta>

Running with Terra using cloud computing
We performed the assemblies of human genomes using preemptible instances provided by the Google Cloud Platform. Hifiasm (UL) was divided into three steps to make full use of the preemptible instances.

Running asmgene
For human genome assemblies, we aligned the cDNAs to the CHM13v2 reference genome and assembled contigs by minimap2 (version 2.24-r1122), and evaluated the gene completeness with paftools.js from the minimap2 package: where 'lineage dataset' was set to brassicales odb10 and solanales odb10 for Arabidopsis and potato genome assemblies, respectively.

Phasing accuracy evaluation
For human genome assemblies, we used yak (version 0.1-r62-dirty) to measure the hamming error rate and the switch error rate: yak trioeval -t <nThreads> <paternal.yak> <maternal.yak> <asm contig.fa> For the potato genome assembly, we employed haplotype-specific HiFi reads as markers to assess the phasing errors.

Counting Telomere-to-Telomere (T2T) contigs
The HPRC workflow (https://github.com/biomonika/HPP/blob/main/assembly/wdl/workflows/assessAsemblyComp letness.wdl) was utilized to detect the T2T contigs. The CHM13v2 reference was set as the reference genome when running the HPRC workflow with human genome assemblies. For non-human Arabidopsis and potato genomes, all assemblies were aligned to the published genomes generated from the same datasets.
Supplementary The phasing switch error rate refers to the proportion of adjacent haplotype-specific marker pairs originating from different haplotypes, while the phasing hamming error rate represents the percentage of haplotype-specific markers that are incorrectly phased. For human genome assemblies, phasing errors were calculated using haplotype-specific 31-mers obtained from parental short reads with yak. For the potato genome assembly, we employed haplotype-specific HiFi reads as markers to assess the phasing errors.